Modification of Probability Distributions Applied to Word Length Research

نویسندگان

  • Gejza Wimmer
  • Viktor Witkovský
  • Gabriel Altmann
چکیده

In linguistic modelling, a number of probability distributions must be modified because of different individual influences on the data and gradual shifts to new attractors. Several kinds of modifications, estimations and tests of already published models are presented for fitting purposes. *Address correspondence to: Gejza Wimmer, Mathematical Institute SAS, Stefánikova 49, SK81473 Bratislava, Slovakia 1Supported by the Slovak Grant Agency, grants 1/4196/99, 2/5126/99 258 G. WIMMER ET AL. X2, ..., Xk) is considered to be multinomial with parameters N, π1, π2, ...,πk, i.e., (1) where ni ∈ {0,1, ... ,N}, i = 1,2, ... , k and n1 + n2 + ... + nk = N. The numbers π1, π2, ..., πk are the (theoretical) probabilities of the occurrences of words of length i (i = 1,2,..., k) and it holds that 0 < πi < 1 and π1 + π2 + ... + πk = 1. However, in accordance with previous research we assume that {π1, π2, ..., πk} can be represented by different models that can be derived from appropriate approaches (cf. Wimmer et al., 1994; Wimmer & Altmann, 1996) under the assumption that the same model holds for ‘homogeneous’ data. The hypothesis that a certain model holds can be expressed mathematically as where θ is the vector of parameters, A is the parameter space and Rm the m-dimensional Euclidian space. An example of such a hypothesis is, e.g., Ammermann’s (1997) use of the positive negative binomial distribution in the form where the probability of the greatest length class in the data, k, is determined as A hypothesis of this kind can be tested. There is a number of test statistics with asymptotic properties which do not inspire much confidence on empirical researchers if they merely have samples with small sizes. Unfortunately, they mann, 1996) and to use the same model for a set of homogeneous data. Unfortunately, it is not always possible to capture all data merely by variation of parameters. One always finds exceptions which are understood as signals of the author’s effort to leave the familiar attractor. If the deviations are small, the researchers – with or without the knowledge of causes which are hard to find out in texts – use the techniques of local or global modification of distributions which is found in all sciences and corresponds to the usual research process (cf. Lakatos, 1974): one maintains a theory until a new one replaces it and explains more than the original one. Many researchers simultaneously use several models because they examine mixed data. But even in homogeneous data they sometimes present several models – as far as it is made possible by the existent software – because there is no special reason to choose a definite model inductively. In this contribution we want to show the techniques of modification as applied to models of word length distribution. In this domain there are so many modifications that it is worth treating them in a unified way. STATISTICAL APPROACH AND MODELLING When measuring word length we have the following situation: in the given text of length N (or in a sample of size N from a dictionary), we measure the values y1, y2,...,yN, where yj is the length of the jth word (j = 1,2,...,N) measured in terms of syllable, phoneme or letter numbers. Let Xi be the random variable ‘number of words of length i’ in the text or dictionary, i = 1,2,...,k (in some cases there is also i = 0), then by simple summation of the values y1, y2, ..., yN we obtain the numbers ni , designating the frequencies of words of length i in text, i = 1,2,...,k. The vector (n1, n2,...,nk) is the realisation of the random vector (X1, X2,...,Xk) in the given text (or dictionary), k is the greatest word length, N is the sample size (e.g., text length, sample size from the dictionary, etc.). The common distribution of the vector (X1, 259 MODIFICATION OF PROBABILITY DISTRIBUTIONS still have other disadvantages of a mathematical nature which could not be overcome as yet. Let us consider in greater detail the test procedure starting from the assumption that if H0 is valid the vector θ0 ∈ A ø Rm, exists so that πi = Pi(θ0), i = 1,2,...,k. Under relatively general conditions about A ø Rm, πi(θ), ∂πi(θ)/∂θj (cf., e.g., Rao, 1973, 5e.2, 6a.2, 6b) one can show that for the so-called maximum likelihood (ML) estimator, θ = θ (X1, X2,...,Xk), being the solution of the ML-equations (2) it holds that for N → ∞

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bertrand’s Paradox Revisited: More Lessons about that Ambiguous Word, Random

The Bertrand paradox question is: “Consider a unit-radius circle for which the length of a side of an inscribed equilateral triangle equals 3 . Determine the probability that the length of a ‘random’ chord of a unit-radius circle has length greater than 3 .” Bertrand derived three different ‘correct’ answers, the correctness depending on interpretation of the word, random. Here we employ geomet...

متن کامل

Shortest Path Problem with Gamma Probability Distribution Arc Length

We propose a dynamic program to find the shortest path in a network having gamma probability distributions as arc lengths. Two operators of sum and comparison need to be adapted for the proposed dynamic program. Convolution approach is used to sum two gamma probability distributions being employed in the dynamic program.

متن کامل

Distinct word length frequencies: distributions and symbol entropies

The distribution of frequency counts of distinct words by length in a language’s vocabulary will be analyzed using two methods. The first, will look at the empirical distributions of several languages and derive a distribution that reasonably explains the number of distinct words as a function of length. We will be able to derive the frequency count, mean word length, and variance of word lengt...

متن کامل

A risk adjusted self-starting Bernoulli CUSUM control chart with dynamic probability control limits

Usually, in monitoring schemes the nominal value of the process parameter is assumed known. However, this assumption is violated owing to costly sampling and lack of data particularly in healthcare systems. On the other hand, applying a fixed control limit for the risk-adjusted Bernoulli chart causes to a variable in-control average run length performance for patient populations with dissimilar...

متن کامل

Simulation of Fixed Length Word String Probability Distributions

I. WORD STRING DISTRIBUTION SIMULATION The aim of this description is to provide a framework for simulating specific word string distributions whose general structure follows the structure of word string posterior distributions relevant for string recognition. For the specific case of string classes w 1 = w1, w2, . . . , wN of fixed length N with words wn ∈ V , with vocabulary size V = |V|, and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Quantitative Linguistics

دوره 6  شماره 

صفحات  -

تاریخ انتشار 1999